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CN Abstract 

We show how the Hamiltonian Monte Carlo algorithm can sometimes be speeded up by 
"splitting" the Hamiltonian in a way that allows much of the movement around the state 
space to be done at low computational cost. One context where this is possible is when the 
log density of the distribution of interest (the potential energy function) can be written as 
the log of a Gaussian density, which is a quadratic function, plus a slowly varying function. 
Hamiltonian dynamics for quadratic energy functions can be analytically solved. With the 
splitting technique, only the slowly- varying part of the energy needs to be handled numer- 
ically, and this can be done with a larger stepsize (and hence fewer steps) than would be 
necessary with a direct simulation of the dynamics. Another context where splitting helps is 
when the most important terms of the potential energy function and its gradient can be eval- 
uated quickly, with only a slowly-varying part requiring costly computations. With splitting, 
^ the quick portion can be handled with a small stepsize, while the costly portion uses a larger 

^ stepsize. We show that both of these splitting approaches can reduce the computational cost 

of sampling from the posterior distribution for a logistic regression model, using either a 
Gaussian approximation centered on the posterior mode, or a Hamiltonian split into a term 
\^ that depends on only a small number of critical cases, and another term that involves the 

larger number of cases whose influence on the posterior distribution is small. Supplemental 
materials for this paper are available online. 
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1 Introduction 



The simple Metropolis algorithm (Metropolis et al., 1953) is often effective at exploring low- 



dimensional distributions, but it can be very inefficient for complex, high-dimensional distributions 
— successive states may exhibit high autocorrelation, due to the random walk nature of the 
movement. Faster exploration can be obtained using Hamiltonian Monte Carlo, which was first 
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introduced by Duane et al. ( 1987), who called it "hybrid Monte Carlo" , and which has been recently 



reviewed by Neal (2010). Hamiltonian Monte Carlo (HMC) reduces the random walk behavior 
of Metropolis by proposing states that are distant from the current state, but nevertheless have 
a high probability of acceptance. These distant proposals are found by numerically simulating 
Hamiltonian dynamics for some specified amount of fictitious time. 

For this simulation to be reasonably accurate (as required for a high acceptance probability), 
the stepsize used must be suitably small. This stepsize determines the number of steps needed to 
produce the proposed new state. Since each step of this simulation requires a costly evaluation of 
the gradient of the log density, the stepsize is the main determinant of computational cost. 



In this paper, we show how the technique of "splitting" the Hamiltonian (Leimkuhler and 



Reich 


2004; 


Neal 


2010 



for Hamiltonian Monte Carlo. In our approach, splitting separates the Hamiltonian, and con- 
sequently the simulation of the dynamics, into two parts. We discuss two contexts in which 
one of these parts can capture most of the rapid variation in the energy function, but is com- 
putationally cheap. Simulating the other, slowly- varying, part requires costly steps, but can 
use a large stepsize. The result is that fewer costly gradient evaluations are needed to pro- 
duce a distant proposal. We illustrate these splitting methods using logistic regression mod- 
els for classification problems. Computer programs for our methods are publicly available from 
|http: //www. ics -uci . edu/~babaks/Site/Codes .htmll 



Before discussing the splitting technique, we provide a brief overview of HMC. (See Neal 



2010, for an extended review of HMC.) To begin, we briefly discuss a physical interpretation of 



Hamiltonian dynamics. Consider a frictionless puck that slides on a surface of varying height. 
The state space of this dynamical system consists of its position, denoted by the vector q, and its 
momentum (mass, m, times velocity, v), denoted by a vector p. Based on q and p, we define the 
potential energy, U{q), and the kinetic energy, K{p), of the puck. U (g) is proportional to the height 
of the surface at position q. The kinetic energy is m\v\^/2, so K{p) = |pp/(2m). As the puck 
moves on an upward slope, its potential energy increases while its kinetic energy decreases, until 
it becomes zero. At that point, the puck slides back down, with its potential energy decreasing 
and its kinetic energy increasing. 

The above dynamic system can be represented by a function of q and p known as the Hamil- 
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tonian, which for HMC is usually defined as the sum of a potential energy, U, depending only on 
the position and a kinetic energy, K, depending only on the momentum: 

H{q,p) ^ U{q) + Kip) (1) 

The partial derivatives of H{q, p) determine how q and p change over time, according to Hamilton's 
equations: 
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(2) 



These equations define a mapping Tg from the state at some time t to the state at time t + s. 

We can use Hamiltonian dynamics to sample from some distribution of interest by defining 
the potential energy function to be minus the log of the density function of this distribution 
(plus any constant). The position variables, g, then correspond to the variables of interest. We 
also introduce fictitious momentum variables, p, of the same dimension as q, which will have a 
distribution defined by the kinetic energy function. The joint density of q and p is defined by the 
Hamiltonian function as 

P{q,p) = ^ exp(^- H{q,p)^ 

When H{q,p) — U{q) + K{p), as we assume in this paper, we have 

P{q,p) = ^ exp(^- [/(g)) exp(^- ii'(p)) 

so q and p are independent. 

In applications to Bayesian statistics, q consists of the model parameters (and perhaps latent 
variables), and our objective is to sample from the posterior distribution for q given the observed 
data D. To this end, we set 

U{q) = -log{P{q)Liq\D)) 

where P{q) is our prior and L{q\D) is the likelihood function given data D. 

Having defined a Hamiltonian function corresponding to the distribution of interest (e.g., a 
posterior distribution of model parameters), we could in theory use Hamilton's equations, ap- 
plied for some specified time period, to propose a new state in the Metropolis algorithm. Since 
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Hamiltonian dynamics leaves invariant the value of H (and hence the probability density), and 
preserves volume, this proposal would always be accepted. (For a more detailed explanation, see 



Neal| ( [2010| ).) 

In practice, however, solving Hamiltonian's equations exactly is too hard, so we need to ap- 
proximate these equations by discretizing time, using some small step size e. For this purpose, the 
leapfrog method is commonly used. It consists of iterating the following steps: 

dU 

Pjit + e/2) = p,{t) - (e/2) — (g(t)) 

dK 

Qjit + e) = q,{t) + t—{p{t + t/2)) (3) 

dU 

Pjit + e) = Pjit + e/2) - (e/2) — {q{t + e)) 

Typically, K{p) = p'^M~^p / 2, with M usually being a diagonal matrix with elements mi, . . . , m^, 
so that K{p) = ^iPi /2mi. The pj are then independent and Gaussian with mean zero, with pj 
having variance mj. The time derivative of qj is then dK/dpj = pj/rrij. 

We can use some number, L, of these leapfrog steps, with some stepsize, e, to propose a new 
state in the Metropolis algorithm. We apply these steps starting at the current state {q,p), with 
fictitious time set to t = 0. The final state, at time t = Le, is taken as the proposal, {q*,p*)- (To 
make the proposal symmetric, we would need to negate the momentum at the end of the trajectory, 
but this can be omitted when, as here, K{p) = K{—p) and p will be replaced (see below) before the 
next update.) This proposal is either accepted or rejected (with the state remaining unchanged), 
with the acceptance probability being 

min[l, exp(-if (g*,p*) + H{q,p))] = min[l, exp{-U{q*) + U{q) - K{p*) + K{p))] 

These Metropolis updates will leave H approximately constant, and therefore do not explore 
the whole joint distribution of q and p. The HMC method therefore alternates these Metropolis 
updates with updates in which the momentum is sampled from its distribution (which is inde- 
pendent of q). When K{p) = J^iPl f^^iy ^ach pj is sampled independently from the Gaussian 
distribution with mean zero and variance rrij. 

As an illustration, consider sampling from the following bivariate normal distribution 

/3\ /I 0.95 

q ~ N{fi, S), with = I „ I and S 
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Figure 1: Comparison of Hamiltonian Monte Carlo (HMC) and Random Walk Metropolis (RWM) 
when applied to a bivariate normal distribution. Left plot: The first 30 iterations of HMC with 
20 leapfrog steps. Right plot: The first 30 iterations of RWM with 20 updates per iterations. 



For HMC, we set L = 20 and e = 0.15. The left plot in Figure [T] shows the first 30 states 
from an HMC run. The density contours of the bivariate normal distribution are shown as gray 
ellipses. The right plot shows every 20*'^ state from the first 600 iterations of a run of a simple 
random walk Metropolis (RWM) algorithm. (This takes time comparable to that for the HMC 
run.) The proposal distribution for RWM is a bivariate normal with the current state as the 
mean, and 0.15^/2 as the covariance matrix. (The standard deviation of this proposal is the same 
as the stepsize of HMC.) Figure [T] shows that HMC explores the distribution more efficiently, 
with successive samples being further from each other, and autocorrelations being smaller. For 
an extended review of HMC, its properties, and its advantages over the simple random walk 



Metropolis algorithm, see Neal (2010). 

In this example, we have assumed that one leapfrog step for HMC (which requires evaluating 
the gradient of the log density) takes approximately the same computation time as one Metropolis 
update (which requires evaluating the log density), and that both move approximately the same 



distance. The benefit of HMC comes from this movement being systematic, rather than in a 
random walk. We now propose a new approach called Split Hamiltonian Monte Carlo (Split 
HMC), which further improves the performance of HMC by modifying how steps are done, with 
the effect of reducing the time for one step or increasing the distance that one step moves. 



2 Splitting the Hamiltonian 



As discussed by Neal (2010), variations on HMC can be obtained by using discretizations of 



Hamiltonian dynamics derived by "splitting" the Hamiltonian, H, into several terms: 

H{q,p) = Hi{q,p) + H2{q,p) + ■■■ + HK{q,p) 

We use Ti^t, for i = 1, . . . ,k to denote the mapping defined by Hi for time t. Assuming that 
we can implement Hamiltonian dynamics Hk exactly, the composition Ti ,, o T2^e o . . . o T^, ^ is a 



valid discretization of Hamiltonian dynamics based on H if Hi are twice differentiable (Leimkuhler 



and Reich, 2004). This discretization is symplectic and hence preserves volume. It will also be 



reversible if the sequence of H^ are symmetric: Hi{q,p) = HK-i+i{q,p)- 

Indeed, the leapfrog method ^ can be regarded as a symmetric splitting of the Hamiltonian 

H{q,p) = U{q) + K{p) as 



H{q,p) = U{q)/2 + K{p) + U{q)/2 



(4) 



In this case, Hi{q,p) = H^{q,p) = U{q)/2 and H2{q,p) = K{p). Hamiltonian dynamics for Hi is 

dHi 



dt 

dpj 
~dt 



dpj 

_dHi 
dqj 







1 dU 

2 % 



which for a duration of e gives the first part of a leapfrog step. For H2, the dynamics is 

dqj _ dH2 _ OK 
dt dpj dpj 

dpj _ dH2 _ Q 
dt dqj 

For time e, this gives the second part of the leapfrog step. Hamiltonian dynamics for H^ is the 
same as that for Hi since Hi = H^, giving the the third part of the leapfrog step. 
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2.1 Splitting the Hamiltonian when a partial analytic solution 
is available 

Suppose the potential energy U{q) can be written as Uo{q) + Ui{q). Then, we can spht H as 

H{q,p) = U,{q)/2 + [Uo{q) + Kip)] + U,{q)/2 (5) 

Here, Hi{q,p) = H^{q,p) = f/i(g)/2 and H2{q,p) = Uo{p) + K{j>). The first and the last terms in 
this splitting are similar to Eq. Q, except that Ui{q) replaces U{q)^ so the first and the last part 
of a leapfrog step remain as before, except that we use Ui{q) rather than U{q) to update p. Now 
suppose that the middle part of the leapfrog, which is based on the Hamiltonian UQ{q) + K{p), can 
be handled analytically; that is, we can compute the exact dynamics for any duration of time. We 
hope that since this part of the simulation introduces no error, we will be able to use a larger step 
size, and hence take fewer steps, reducing the computation time for the dynamical simulations. 

We are mainly interested in situations where f/o(g) provides a reasonable approximation to 
[/(g), and in particular on Bayesian applications, where we approximate U by focusing on the 
the posterior mode, g, and the second derivatives of U at that point. We can obtain q using 
fast methods such as the Newton-Raphson, when analytical solutions are not available. We then 
approximate U{q) with Uo{q), the energy function for N{q, J'^^{q)), where J^{q) is the Hessian 
matrix of U at q. Finally, we set Ui{q) = U{q) — Uo{q), the error in this approximation. 



Beskos et al. (2011) have recently proposed a similar splitting strategy for HMC, in which a 
Gaussian component is handled analytically, in the context of high-dimensional approximations 
to a distribution on an infinite-dimensional Hilbert space. In such applications, the Gaussian 
distribution will typically be derived from the problem specification, rather than being found as a 
numerical approximation, as we do here. 

Using a normal approximation in which Uo{q) = \{q — q)^J{q){q — q), and letting K{p) = \p^p 



(the energy for the standard normal distribution), H2{q,p) = U^i^q) + K{p) in Eq. (([5])) will be 
quadratic, and Hamilton's equations will be a system of first-order linear differential equations 
that can be handled analytically (Polyanin et al. , 2002). Specifically, setting q* = q — q, the 



dynamical equations can be written as follows: 



d 


q*{t) 




/ 




q*{t) 


di 


Pit) 




-J{q) 




Pit) 
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where / is the identity matrix. Defining X 
where 



(g, p), this can be written as —X{t) = AX{t), 



A 



/ 

The solution of this system is X(t) = e^^ Xq, where Xq is the initial value at time t = 0. This 
can be simplified by diagonalizing the coefficient matrix A as 

A = TDT-^ 

where F is invertible and D is a diagonal matrix. The system of equations can then be written as 

d 



dt 



VDV-'X{t) 



Now, let Y{t) = T-^X{t). Then, 



d_ 
di 



Y{t) = DY{t) 



The solution for the above equation is Y{t) = e^^Yo, where Yq = T ^Xq. Therefore, 

X{t) = Te^^T-^Xo 

and e^* can be easily computed by simply exponentiating the diagonal elements of D times t. 

The above analytical solution is of course for the middle part (denoted as H2) of Eq. ^ only. 
We still need to approximate the overall Hamiltonian dynamics based on H, using the leapfrog 
method. Algorithm 1 shows the corresponding leapfrog steps — after an initial step of size e/2 
based on Ui{q), we obtain the exact solution for a time step of e based on H2{q,p) = Uo{q) +K{p), 
and finish by taking another step of size e/2 based on Ui{q). 



2.2 Splitting the Hamiltonian by splitting the data 

The method discussed in the previous section requires that we be able to handle the Hamiltonian 
H2{q,p) = Uo{q) + K{p) analytically. If this is not so, splitting the Hamiltonian in this way may 
still be beneficial if the computational cost for Uo{q) is substantially lower than for U{q). In these 
situations, we can use the following split: 

M 

H{q,p) = U,iq)/2+Y,[Uo{p)/2M + K{p)/M + Uo{p)/2M] + U,{q)/2 (6) 

m=l 
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Algorithm 1: Leapfrog for split Hamiltonian 
Monte Carlo with a partial analytic solution. 



Algorithm 2: Nested leapfrog for split Hamil- 
tonian Monte Carlo with splitting of data. 



R ^ Te^T-i 

Sample initial values for p from A^(0, /) 
for £ = 1 to L do 



q* ^ q — q 
Xo^{q*,p) 
iq*,p)^RXo 
q ^ q* + q 
p^p- (e/2) 
end for 



dq 



dUi 
dq 



Sample initial values for p from A^(0, /) 
for £ = 1 to L do 

P^P - (e/2)- 



dq 

for m = 1 to M do 

dUo 



p<-p- (e/2M) 
q ^ q + {^/M)p 
p^p - (e/2M) 



end for 

p^p- (e/2) 
end for 



dq 

dUo 
dq 



dUi 
dq 



for some M > 1. The above discretization can be considered as a nested leapfrog, where the outer 
part takes half steps to update p based on Ui alone, and the inner part involves M leapfrog steps 
of size e/M based on Uq. Algorithm 2 implements this nested leapfrog method. 

For example, suppose our statistical analysis involves a large data set with many observations, 
but we believe that a small subset of data is sufficient to build a model that performs reasonably 
well (i.e., compared to the model that uses all the observations). In this case, we can construct 
[/o(q') based on a small part of the observed data, and use the remaining observations to construct 
Ui{q). That is, we divide the observed data, y, into two subsets: Rq, which is used to construct 
Uo{q), and Ri, which is used to construct Ui. 

u{e) = Uoie) + u,{e) 

Uoie) = -log[P(^^)]-5^1og[P(i/.|^^)] (7) 

ieRo 

UliO) = - 5^ log[P(i/,|^)] 

i'eRi 

Note that the prior appears in Uo{9) only. 



Neal (2010) discusses a related strategy for splitting the Hamiltonian by splitting the observed 



data into multiple subsets. However, instead of randomly splitting data, as proposed there, here 




~r 



X1 



Figure 2: An illustrative binary classification problem with n = 100 data points and two covariates, 
xi and X2, with the two classes represented by white circles and black squares. 



we split data by building an initial model based on the maximum a posterior (MAP) estimate, q, 
and use this model to identify a small subset of data that captures most of the information in the 
full data set. We next illustrate this approach for logistic regression models. 

3 Application of Split HMC to logistic regression models 

We now look at how Split HMC can be applied to Bayesian logistic regression models for binary 
classification problems. We will illustrate this method using the simulated data set with n = 100 
data points and p = 2 covariates that is shown in Figure |2} 

The logistic regression model assigns probabilities to the two possible classes (denoted by 
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and 1) in case i (for i = 1, . . . , n) as follows: 

1 + exp(a + XI (3) 

Here, Xi is the vector of length p with the observed values of the covariates in case z, a is the 
intercept, and {3 is the vector of p regression coefficients. We use 6 to denote the vector of all p + 1 
unknown parameters, (a, (3). 

Let P{0) be the prior distribution for 9. The posterior distribution of 9 given x and y is 
proportional to P{9) nr=i P{yi\^ii The corresponding potential energy function is 

n 

U{9) = -log[P(^^)] - J]log[P(?/,|x„^)] 

i=l 

We assume the following (independent) priors for the model parameters: 

~ iV(0,a2), j = l,...,p 

where cTq, and o"^ are known constants. 

The potential energy function for the above logistic regression model is therefore as follows: 

2 V o2 n 

^(^) = ^ + E|t - 5^ [?/.(« + xf/3)-log(l + exp(a + xf/3)) 

The partial derivatives of the energy function with respect to a and the (3j are 

dU a r exp(a + xj(3) i dU (3j r exp(a + xjl3) ' 

^ ~ l + exp(« + xf/3)J' W3 ^ ^~ ^""^'r^" l + exp(« + xf/3). 

To apply Algorithm 1 for Split HMC to this problem, we approximate the potential energy 
function U{9) for the logistic regression model with the potential energy function Uq{9) of the 
normal distribution N{9, J'^^{9)), where 9 is the MAP estimate of model parameters. Uo{9) 
usually provides a reasonable approximation to U{9), as illustrated in Figure |3} In the plot on the 
left, the solid curve shows the value of the potential energy, U, as /3i varies, with /32 and a fixed 
to their MAP values, while the dashed curve shows Uq for the approximating normal distribution. 
The right plot of Figure |3] compares the partial derivatives of U and Uq with respect to (3i, showing 
that dUo/dPj provides a reasonable linear approximation to dU /d(3j. 



11 



12 3 4 

Pl 

Figure 3: Left plot: The potential energy, U, for the logistic regression model (solid curve) and 
its normal approximation, Uq (dashed curve), as Pi varies, with other parameters at their MAP 
values. Right plot: The partial derivatives of U and Uq with respect to 

Since there is no error when solving Hamiltonian dynamics based on Uq{6), we would expect 
that the total discretization error of the steps taken by Algorithm 1 will be less that for the 
standard leapfrog method, for a given stepsize, and that we will therefore be able to use a larger 
stepsize — and hence need fewer steps for a given trajectory length — while still maintaining a 
good acceptance rate. The stepsize will still be limited to the region of stability imposed by the 
discretization error from Ui = U — Uq, but this limit will tend to be larger than for the standard 
leapfrog method. 

To apply Algorithm 2 to this logistic regression model, we split the Hamiltonian by splitting 
the data into two subsets. Consider the illustrative example discussed above. In the left plot of 
Figure |4| the thick line represents the classification boundary using the MAP estimate, 6. For the 
points that fall on this boundary line, the estimated probabilities for the two groups are equal, both 
being 0.5. The probabilities of the two classes become less equal as the distance of the covariates 
from this line increases. We will define Uq using the points within the region, Rq, within some 
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X1 Pl 

Figure 4: Left plot: A split of the data into two parts based on the MAP model, represented by 
the solid line; the energy function U is then divided into f/o, based on the data points in Rq, and 
Ui, based on the data points in Ri. Right plot: The partial derivatives of U and Uq with respect 
to Pl, with other parameters at their MAP values. 



distance of this line, and define Ui using the points in the region, at a greater distance from 
this line. This split can also be viewed as putting in i?o those points for which the prediction of 
the class based on the MAP parameter estimate has high entropy, with points where this entropy 
is lower going to Ri. (The entropy, which for a binary class with probabilities -P(O) and -P(l) is 
defined as -P(O) log(l/P(0)) + -P(l) log(l/P(l)), declines monotonically as the class probabilities 
move away from 0.5.) 

The shaded area in Figure |4] shows the region, Rq, containing the 30% of the observations 
closest to the MAP line, or equivalently the 30% of observations with the highest entropy for the 
class prediction, or the 30% of observations for which the probability of class 1 is closest (in either 
direction) to 0.5. The unshaded region containing the remaining data points is denoted as Ri. 
Using these two subsets, we can split the energy function U{6) into two terms: Uo{6) based on the 
data points that fall within Rq, and Ui based on the data points that fall within Ri (see Eq. ([T])). 
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Then, we use Eq. (|6]) to split the Hamiltonian dynamics. 

Note that Uq is not used to approximate the potential energy function, U, for the acceptance 
test at the end of the trajectory. Rather, dUo/df3j is used to approximate dU/d(3j, which is the 
costly computation when we simulate Hamiltonian dynamics. Recall that 

dU 13 j r exp(a + xf (3) ~ 

Wj ^ ^ ~ Z^^^n^^~l + exp(« + xf/3). 

For a given (a, (3), the term exp(a; + xff3)/ (1 + exp(Q; + xj(3)) is in fact P{yi = l\xi, a, (3). For 
high entropy data points close to the boundary line defined based on (a, this estimated proba- 
bility is close to 0.5, so Ui — exp(Q; + xjf3)/{l + exp(a + xjf3)) tends to be large. (Note that i/i is 
either zero or one.) This term is also large for misclassified cases. Such cases, however, are likely 
to be close to the boundary line too. Therefore, for a given (a, the value of dU/dPj is mainly 
dominated by high entropy data points (at least when 9 is close to 9). The right plot of Figure [4] 
shows the approximation of dU /df3j by dUo/d(3j. 



4 Experiments 



In this section, we use simulated and real data to compare our proposed methods to the standard 
HMC. 

For each problem, we set the number of leapfrog steps to L = 20 for the standard HMC, and 



find e such that the acceptance probability (AP) is close to 0.65 (Neal, 2010). We set L and e for 
the Split HMC methods such that the trajectory length, eL, remains the same, but with a larger 
stepsize and hence a smaller number of steps. We try to choose e for the Split HMC methods such 
that the acceptance probability is equal to that of standard HMC. However, increasing the stepsize 
beyond a certain point leads to instability of trajectories, in which the error of the Hamiltonian 



grows rapidly with L (Neal, 2010), so that proposals are rejected with very high probability. This 
sometimes limits the stepsize of Split HMC to values at which the acceptance probability is greater 
than the 0.65 aimed at for standard HMC. 

When splitting the data in order to use Split HMC Algorithm 2, we always use the 30% of the 
data with highest entropy MAP class predictions (ie, class probabihties closest to 0.5) to define 
Uq. We always set the number of inner leapfrog steps in Algorithm 2 to M = 3. 
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To measure the efficiency of each samphng method, we use the autocorrelation time (ACT) 
by dividing the posterior samples into batches of size B, and estimating ACT as follows ( Neal[ 



1993j[Geyer[[1992j): 

^ - 4 



Here, S is the sample variance and 5*^ is the sample variance of batch means. Following Thomp- 



son 



(2010), we divide the posterior samples into N^^^ batches of size B = N"^^^. Throughout 



this section, we set the number of Markov chain Monte Carlo (MCMC) iterations for simulating 
posterior samples to A^ = 100000. 

The autocorrelation time can be roughly interpreted as the number of MCMC transitions 
required to produce samples that can be considered as independent. For the logistic regression 
problems discussed in this section, we find the autocorrelation times for each regression parameter 
separately, and then compare different methods using the maximum autocorrelation time (i.e., the 
autocorrelation time of the slowest parameter in the chain), denoted as Tmax, over all regression 
parameters. 

We adjust Tmax to account for the varying computation time needed by the different methods 
in two ways. One is to compare different methods using Tmax x where s is the CPU time per 
iteration, using an implementation written in R. This measures the CPU time required to produce 
samples that can be regarded as independent samples. We also compare in terms of Tmax x 9, where 
g is the number of gradient computations on the number of cases in the full data set required for 
each trajectory simulated by HMC. This will be equal to the number of leapfrog steps, L, for 
standard HMC or Split HMC using a normal approximation. When using data splitting with a 
fraction / of data in Rq and M inner leapfrog steps, g will be (/M + (1 — /)) x L, which is 1.6L 
when / = 0.3 and M = 3, as for these experiments. In general, we expect that computation time 
will be dominated by the gradient computations counted hy g, so that Tmax x 9 will provide a 
measure of performance independent of any particular implementation. In our experiments, s was 
close to being proportional to g, except for slightly larger than expected times for Split HMC with 
data splitting. 
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HMC 


Split HMC 






Normal Appr. 


Data Splitting 


L 


20 


i i 


1 1 
1 i 


a 


20 


1 1 
i i 




s 


0.401 


998 




AP 


0.58 


0.73 


0.70 


''"max 


281.7 


170.0 


212.2 


''"max ^ 9 


5634 


1870 


3735 


''"max '5 


113.03 


38.78 


78.57 



Table 1: Split HMC (with normal approximation and data splitting) compared to standard HMC 
using simulated data, on a data set with n = 10000 observations and p = 100 covariates. 

4.1 Simulated data 

We first tested the methods on a simulated data set with 100 covariates and 10000 observations. 
The covariates were sampled as Xij ~ A^(0, o"|), for z = 1, . . . , 10000 and j = 1, . . . , 100, with aj 
set to 5 for the first five variables, to 1 for the next five variables, and to 0.2 for the remaining 90 
variables. We sampled true parameter values, a and Pj, independently from A^(0, 1) distributions. 
Finally, we sampled the class labels according to the model, as yt ~ Bernoulli(pi) with logit(pi) = 
a + xjl3. 

For the Bayesian logistic regression model, we assumed normal priors with mean zero and 
standard deviation 5 for a and Pj, where j = 1, . . . , 100. We ran standard HMC, Split HMC with 
normal approximation, and Split HMC with data splitting for = 100000 iterations. For the 
standard HMC, we set L = 20 and e = 0.02, so the trajectory length was 20 x 0.015 = 0.3. For 
Split HMC with normal approximation and Split HMC with data splitting, we reduce the number 
of leapfrog steps to 11, while increasing the stepsizes so that the trajectory length remained 0.3. 

Table [l] shows the results for the three methods. The CPU times (in seconds) per iteration, s, 
and Tmax X s for the Split HMC methods are substantially lower than for standard HMC. Especially, 
the Split HMC with normal approximation is almost three times faster compared to the standard 
HMC, in terms of generating samples that can be considered independent. The comparison is 
similar looking at Tmax x g. 
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4.2 Results on real data sets 

In this section, we evaluate our proposed method using four real binary classification prob- 
lems. The data for these four problems are available from the UCI Machine Learning Repository 



(http://archive.ics.uci.edu/inl/index.htinl). For all data sets, we standardized the numeri 



cal variables to have mean zero and standard deviation 1. Further, we assumed normal priors with 
mean zero and standard deviation 5 for the regression parameters. We used the setup described 
at the beginning of Section |4| running each Markov chain for N = 100000 iterations. Table [2] 
summarizes the results using the three sampling methods. 

The first problem, StatLog, involves using multi-spectral values of pixels in a satellite image 
in order to classify the associated area into soil or cotton crop. (In the original data, different 
types of soil are identified.) The sample size for this data set is n = 4435, and the number of 
features is p = 37. For the standard HMC, we set L = 20 and e = 0.075. For the two Split HMC 
methods with normal approximation and data splitting, we reduce L to 14 and 8 respectively while 
increasing e so e x L remains the same as that of the standard HMC As seen in the table, the 
Split HMC methods substantially reduce t^^x x s (and Xmax x g), with the data splitting method 
performing somewhat better than the normal approximation method. 



The second example, Gisette, is a handwritten digit recognition problem (Guyon et al. , 2004) 



constructed based on the MNIST data (http://yann.lecun.com/exdb/mnistl). The objective is 



to separate the highly confusable digits "4" and "9" . The data include n = 6000 observations and 
a set of 5000 highly sparse features. Here, we use the projections of original covariates onto the 
first 100 principal components as predictors. We set L = 20 and e = 0.11 for the standard HMC. 
We reduced the number of leapfrog steps to 14 and 11 for Split HMC with normal approximation 
and data splitting respectively. Again, a substantial improvement is acheived over standard HMC, 
with the normal approximation method performing best on this data set. 

The third problem, CTG, involves analyzing 2126 fetal cardiotocograms along with their re- 



spective diagnostic features (de Campos et al. , 2000). The objective is to determine whether the 



fetal state class is "pathologic" or not. The data include 2126 observations and 21 features. For 
the standard HMC, we set L = 20 and e = 0.075. We reduced the number of leapfrog steps to 14 
and 8 for Split HMC with normal approximation and data splitting respectively. Both zsplitting 
methods improved performance significantly. 
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ilMU 


Split HMC 














Normal Appr. 


Data Splitting 


StatLog 




L 






20 


14 


8 


n = 4435, p 


= 37 


9 






20 


14 


12.8 






s 






0.055 


0.040 


0.039 






AP 






0.63 


0.70 


0.78 






Tmax 






11.0 


10.4 


8.2 








X 


9 


220 


146 


105 








X 


s 


0.61 


0.41 


0.32 


Gisette 




L 






20 


14 


11 


n = 600, p = 


= 100 


9 






20 


14 


17.6 






s 






0.233 


0.169 


0.201 






AP 






0.70 


0.85 


0.78 












21.0 


20.7 


20.6 








X 


9 


420 


290 


363 






Tmax 


X 


s 


4.88 


3.50 


4.11 


CTG 




L 






20 


14 


8 


n = 2126, p 


= 21 


9 






20 


14 


12.8 






s 






0.016 


0.012 


0.013 






AP 






0.65 


0.82 


0.79 






''"max 






64.1 


54.1 


54.6 






''"max 


X 


9 


1282 


757 


699 






''"max 


X 


s 


1.03 


0.63 


0.71 


Chess 




L 






20 


13 


9 


n = 3196, p 


= 36 


9 






20 


13 


14.4 






s 






0.035 


0.024 


0.031 






AP 






0.69 


0.82 


0.84 






''"max 






28.4 


25.2 


36.6 






^max 


X 


9 


568 


328 


527 






^max 


X 


s 


1.00 


0.61 


1.12 



Table 2: HMC and Split HMC (normal approximation and data splitting) on four real data sets. 
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The objective of the last problem, Chess, is to predict chess endgame outcomes — either 
"white can win" or "white cannot win". This data set includes n = 3196 instances, where each 
instance is a board-descriptions for the chess endgame. There are p = 3Q attributes describing the 
board. For the standard HMC, we set L = 20 and e = 0.08. For the two Split HMC methods with 
normal approximation and data splitting, we reduced L to 13 and 9 respectively. Using the normal 
approximation method, Tmax x s and Tmax x 9 were substantially reduced compared to standard 
HMC. For the Split HMC with data splitting, however, the reduction in s and g coincides with 
an increase in Tmax so there was little or no improvement compared to standard HMC. The failure 
of data splitting on this problem is puzzling. (One possible explanation might be that the overall 
high acceptance probability with this method might conceal a low acceptance probability in some 
regions of the parameter space.) 



5 Discussion 

We have proposed two new methods for improving the efficiency of HMC, both based on splitting 

the Hamiltonian in a way that allows much of the movement around the state space to be performed 

at low computational cost. 

While we have focused on the application of our methods to logistic regression models for 

binary classification problems, they can be extended to more complex models, such as multinomial 

logistic (MNL) models for multiple classess. For MNL models, the regression parameters for p 

covariates and J classes form a matrix of {p + 1) rows and J columns, which we can vectorize 

so that the model parameters, 6, become a vector of (p + 1) x J elements. For Split HMC with 

normal approximation, we define Uo{9) using an approximate multivariate normal N{9, J'^^{9)) 

as before. For Split HMC with data splitting, we still construct Uo{9) using a small subset of data 

with the highest entropy class distributions, with the entropy is calculated as 

J 

J2P{y = j\x,9) \og[l/P{y=j\x,9)] 

The data splitting method could be further extended to any model for which it is feasible to find 
a MAP estimate, and then divide the data into two parts based on "residuals" of some form. 



We (Lan and Shahbaba, 2011) have recently extended our Split HMC method to Gaussian 



process models for regression and classifications problems. These nonparametric Bayesian models 
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tend to be computationally intensive, with the computational cost being a major factor limiting 
their application. We have shown that splitting the Hamiltonian function can reduce the com- 
putational cost (measured in terms of Tmax x s) of sampling from the posterior distribution of 
hyperparameters by up to 10 fold, by handling a quadratic part of the Hamiltonian analytically. 

Future research in this area could involve finding tractable approximations to the posterior 
distribution other than normal distributions. One could also investigate other methods for splitting 
the Hamiltonian dynamics by splitting the data — for example, fitting a support vector machine 
(SVM) to binary classification data, and using the support vectors for constructing Uq. 

While the results on simulated data and real classification problems presented in this paper 
have demonstrated the advantages of splitting the Hamiltonian dynamics in terms of improving 
the sampling efficiency, our proposed methods do require preliminary analysis of data, mainly, 
finding the MAP estimate. The performance of our approach obviously depends on how well the 
corresponding normal distribution based on MAP estimates approximates the posterior distri- 
bution, or how well a small subset of data found using this MAP estimate captures the overall 
patterns in the whole data set. This preliminary analysis involves some computational overhead, 
but the computational cost associated with finding the MAP estimate is often negligible compared 
to the potential improvement in sampling efficiency for the full Bayesian model. 



Another approach to improving HMC has recently been proposed by Girolami and Calderhead 



(2011). Their method, Riemannian Manifold HMC (RMHMC), can also substantially improve 
performance. RMHMC utilizes the geometric properties of the parameter space to explore the 
best direction, typically at higher computational cost, to produce distant proposals with high 
probability of acceptance. In contrast, our method attempts to find a simple approximation to 
the Hamiltonian to reduce the computational time required for reaching distant states. It is 
possible that these approaches could be combined, to produce a method that performs better 



than either method alone. The recent proposals of Hoffman and Gelman (2011) for automatic 
tuning of HMC could also be combined with our Split HMC methods. 
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